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Abstract 

We address the online linear optimization problem when the actions of the forecaster are represented by 
binary vectors. Our goal is to understand the magnitude of the minimax regret for the worst possible set of 
actions. We study the problem under three different assumptions for the feedback: full information, and the 
partial information models of the so-called "semi-bandit", and "bandit" problems. We consider both L x -, 
and 1/2 -type of restrictions for the losses assigned by the adversary. 



We formulate a general strategy using Bregman projections on top of a potential-based gradient descent, 
which generalizes the ones studied in the series of papers Gyorgy et al. (2007},|Dani et al. (2008 1, Abernethy 



etal. 


( 2008 1, Cesa-Bianchi and Lugosi ( 2009 1, Helmbold and Warmuth ( 2009|l,|Koolen et al.|(|2010|l,|Uchiya 


et al. 


1 2010)1, Kale et al.|i 2010 1 and Audibert and Bubeck 1 2010 1. We provide simple proofs that recover 



most of the previous results. We propose new upper bounds for the semi-bandit game. Moreover we derive 
lower bounds for all three feedback assumptions. With the only exception of the bandit game, the upper and 



lower bounds are tight, up to a constant factor. Finally, we answer a question asked by Koole rTet al.| ( |2010"] > 
by showing that the exponentially weighted average forecaster is suboptimal against L x adversaries. 

1 Introduction 

In the sequential decision making problems considered in this paper, at each time instance t = 1, . . . ,n, 
the forecaster chooses, possibly in a randomized way, an action from a given set S where S is a sub- 
set of the d-dimensional hypercube {0, l} d . The action chosen by the forecaster at time t is denoted by 
Vt = (Vi.t, ■ ■ ■ , Vd.t) G S. Simultaneously to the forecaster, the adversary chooses a loss vector £ t — 
{ii.t, ■ ■ ■ Ad,t) G [0, +oo) d and the loss incurred by the forecaster is £fVt- The goal of the forecaster is to 
minimize the expected cumulative loss E J2t=i ^t^t where the expectation is taken with respect to the fore- 
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Parameters: set of actions 5 C {0, l} d ; number of rounds n £ N. 
For each round t — 1, 2, . . . , n; 

(1) the forecaster chooses Vt £ S with the help of an external randomization; 

(2) simultaneously the adversary selects a loss vector i t £ [0, +oo) d (without revealing it); 

(3) the forecaster incurs the loss t[ Vt. He observes 

- the loss vector it in the full information game, 

- the coordinates ti^lvt t =i m me semi-bandit game, 

- the instantaneous loss Vt in the bandit game. 

Goal: The forecaster tries to minimize his cumulative loss 2~3™=i &tVt. 



Figure 1: Combinatorial prediction games, 
caster's internal randomization. This problem is an instance of an "online linear optimization" problerrF] see, 



e.g., Awerbuch and Weinberg (2004), McMahan and Blum (2004), Kalai and Vempala (2005'), Gyorgy et al. 


(2007) 


Dani et al. (2008'), Abernethy et al. (2008 ), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth 


(2009) 


Koolen et al. ( 


2010 1, Uchiya et al. ( 2010 ) and|Kale et al. ( 


2010f 



We consider three variants of the problem, distinguished by the type of information that becomes available 
to the forecaster at each time instance, after taking an action. (1) In the full information game the forecaster 
observes the entire loss vector i t ; (2) in the semi-bandit game only those components £ij of it are observable 
for which V i t — 1; (3) in the bandit game only the total loss ifV t becomes available to the forecaster. 



We refer to these problems as combinatorial prediction games. All three prediction games are sketched 
in Figure[T] For all three games, we define the regrej^Jof the forecaster as 

n n 

R n = E ^2 tfVt - min E ^ if v. 
t=i v£ t=i 

In order to make meaningful statements about the regret, one needs to restrict the possible loss vectors 
the adversary may assign. We work with two different natural assumptions that have been considered in the 
literature: 

Loo assumption: here we assume that ||^t||oo < 1 for all < = 1, . . . ,n 
L2 assumption: assume that tjv < 1 for alH = 1, . . . , n and v E S. 

Note that, without loss of generality, we may assume that for all i € {1, . . . , d}, there exists v 6 S with 
Vi = 1, and then the L2 assumption implies the assumption. 

The goal of this paper is to study the minimax regret, that is, the performance of the forecaster that 
minimizes the regret for the worst possible sequence of loss assignments. This, of course, depends on the set 
S of actions. Our aim is to determine the order of magnitude of the minimax regret for the most difficult set 
to learn. More precisely, for a given game, if we write sup for the supremum over all allowed adversaries 
(that is, either or L2 adversaries) and inf for the infimum over all forecaster strategies for this game, we 
are interested in the maximal minimax regret 

R n = max inf sup R n . 

5C{0,l} d 

'in online linear optimization problems, the action set is often not restricted to be a subset of {0, l} d but can be an arbitrary subset 
of R . However, in the most interesting cases, actions are naturally represented by Boolean vectors and we restrict our attention to this 
case. 

2 For the full information game, one can directly upper bound the stronger notion of regret E 2~Z™= 1 ^* ^ m i n «e5 Et=i ^t v 
which is always larger than Rn . However, for partial information games, this requires more work. 
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Table 1 : Bounds on R n proved in this paper (up to constant factor). The new results are set in bold. 
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Table 2: Upper bounds on R n for specific forecasters. The new results are in bold. We also show that the 
bound for EXP2 in the full information game is unimprovable. Note that the bound for (Bandit, LINEXP) 
is very weak. The bounds with * become y 7 dn log d if we restrict our attention to sets S that are "almost 
symmetric" in the sense that for some k, S C {v £ {0, l} d : 5^ i=1 v% < fc} and Conv(6>) n [^; l] 7^ 0. 

Note that in this paper we do not restrict our attention to computationally efficient algorithms. The 
following example illustrates the different games that we introduced above. 

Example 1 Consider the well studied example of path planning in which, at every time instance, the fore- 
caster chooses a path from one fixed vertex to another in a graph. At each time, a loss is assigned to every 
edge of the graph and, depending on the model of the feedback, the forecaster observes either the losses of 
all edges, the losses of each edge on the chosen path, or only the total loss of the chosen path. The goal is 
to minimize the total loss for any sequence of loss assignments. This problem can be cast as a combinatorial 
prediction game in dimension dfor d the number of edges in the graph. 

Our contribution is threefold. First, we propose a variant of the algorithm used to track the best linear 
predictor ( |Herbster and Warmuth| |l998) that is well-suited to our combinatorial prediction games. This leads 
to an algorithm called CLEB that generalizes various approaches that have been proposed. This new point 
of view on algorithms that were defined for specific games (only the full information game, or only the 
standard multi-armed bandit game) allows us to generalize them easily to all combinatorial prediction games, 
leading to new algorithms such as LINPOLY. This algorithmic contribution leads to our second main result, 
the improvement of the known upper bounds for the semi-bandit game. This point of view also leads to a 
different proof of the minimax v nd regret bound in the standard d-armed bandit game that is much simpler 
than the one provided in Audibert and Bubeck (2010). A summary of the bounds proved in this paper can be 
found in Tableland Table [2] In addition we prove several lower bounds. First, we establish lower bounds on 
the minimax regret in all three games and under both types of adversaries, whereas only the cases (L 2 /L oa , 
Full Information) and (L 2 , Bandit) were previously treated in the literature. Moreover we also answer a 
question of |Koolen et al.| ( [2010[ ) by showing that the traditional exponentially weighted average forecaster is 
suboptimal against adversaries. 

In particular, this paper leads to the following (perhaps unexpected) conclusions: 

• The full information game is as hard as the semi-bandit game. More precisely, in terms of R n , the price 
that one pays for the limited feedback of the semi-bandit game compared to the full information game 
is only a constant factor (or a yAog d factor for the L 2 setting). 

• In the full information and semi-bandit game, the traditional exponentially weighted average forecaster 
is provably suboptimal for adversaries while it is optimal for L 2 adversaries in the full information 
game. 
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• Denote by Ai (respectively Aoo) the set of adversaries that satisfy the L2 assumption (respectively the 
Loo assumption). We clearly have A2 C ^loo C dAi- We prove that, in the full information game, 
R n gains an additional factor of \fd at each inclusion. In the semi-bandit game, we show that the same 
statement remains true up to a logarithmic factor. 



Notation. The convex hull of S is denoted Conv(6>). 



2 Combinatorial learning with Bregman projections 

In this section we introduce a general forecaster that we call CLEB (Combinatorial LEaming with Bregman 
projections). Every forecaster investigated in this paper is a special case of CLEB. 
Let V be a convex subset of R d with nonempty interior Int(2?) and boundary dT>. 

Definition 1 We call Legendre any function F : T> — > K such that 

(i) F is strictly convex and admits continuous first partial derivatives on Int{T>) 

( ii) For any u £ dT>, for any v G Int(T>), we have 

lim (u — ii) T VF((l — s)u + sv) = +00. 

s->0,s>0 vv ' 



The Bregman divergence Dp : T> x Int(2?) associated to a Legendre function F is defined by 

D F {u, v) = F(u) - F{v) -{u- v) T VF{v). 

We consider the algorithm CLEB described in Figure [2] The basic idea is to use a potential-based gradient 
descent ([1} followed by a projection |2]) with respect to the Bregman divergence of the potential onto the 
convex hull of S to ensure that the resulting weight vector w t +i can be viewed as w t +i = Ev~ Pt+1 V for 
some distribution pt+i on S. The combination of Bregman projections with potential-based gradient descent 



was first used in Herbster and Warmufh ( 1998|>. Online learning with Bregman divergences without the 



projection step has a long history (see Section 11.11 of Cesa-Bianchi and Lugosi (2006)). As discussed 
below, CLEB may be viewed as a generalization of the forecasters LINEXP and INF. 

The Legendre conjugate F* of F is defined by F* (u) — sup veV {u T v — F(v) } . The following theorem 
establishes the first step of all upper bounds for the regret of CLEB. 



Theorem 2 CLEB satisfies for any u € Conv(S) n V, 

n n n 



(3) 



Proof By applying the definition of the Bregman divergences (or equivalently using Lemma 11.1 of Cesa- 
|Bianchi and Lu gosi ( 2006 )), we obtain 

£jw t - i[u = (u- w t ) T (VF(w' t+1 ) - V%)) 

= D F (u, w t ) + D F (w u w' t+l ) - D F (u, w' t+l ). 



By the Pythagorean theorem (Lemma 11.3 of Cesa-Bianchi and Lugosi (2006)), we have Dp(u,w' t+1 ) > 
D F (u,w t+ i) + D F (w t +i,w' t+1 ), hence 

i t w t - £fu < D F (u,w t ) + Dp(w t ,w' t+1 ) - D F (u,w t+ i) - D F (w t+ i,w' t+l ). 
Summing over t then gives 
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Parameters: 

• a Legendre function F defined on T> with Conv(S) n Int(D) 7^ 

• wi£ Conv(S) n Int(2?) 

For each round t — 1,2, ... ,n; 

(a) Let p t be a distribution on the set 5 such that u>t = Ev~ Pt V. 

(b) Draw a random action Vt according to the distribution p± and observe 

- the loss vector It in the full information game, 

- the coordinates ii t tlv t t =i in the semi-bandit game, 

- the instantaneous loss ifVt in the bandit game. 

(c) Estimate the loss £ t by £ t . For instance, one may take 

- it — it in the full information game, 

- h t = «=i — r^Vi t in the semi-bandit game, 

- i t = P+VtV t T £t, with P t = E v ^ pt (vv T ) in the bandit game. 

(d) Let w' t+1 e Int(X>) satisfying 

VF(w' t+1 ) = VF(w t )-£t. (1) 

(e) Project the weight vector w' t +i defined by {TJ on the convex hull of 5: 

Wt+i £ argmin D F (w,w' t+1 ). (2) 

u)eConv(5)nIntCD) 



Figure 2: Combinatorial learning with Bregman projections (CLEB). 

n n n 

^2IJw t - y^ffiu < D F (u,Wx) - D F (u,w n+1 ) + ^2 ( D F{w t ,w' t+1 ) - D F (w t+1 ,w' t+1 )). (4) 
t=i t—i f=i 

By the nonnegativity of the Bregman divergences, we get 

n n n 

%Wt < D F {u, Wl ) + ^D F {w U w' t+1 ). 

1 = 1 t = l t = l 



From Proposition 11.1 of |Cesa-Bianchi and Lugosi (20061, we have D F (wt,w' t+1 ) = Dp* (V F(wt) — 
it, VF(w t )) , which concludes the proof. ■ 

As we will see below, by the equality E^™=i ^tYt = ^Y^t=i ifwt, and provided that ifVt and £fu 
are unbiased estimates of E,£fV t and E£fu, Theorem [5] leads to an upper bound on the regret R n of CLEB, 
which allows us to obtain the bounds of Table|2]by using appropriate choices of F. Moreover, if F admits an 
Hessian, denoted V 2 F, that is always invertible, then one can prove that up to a third-order term (in £ t ), the 
regret bound can be written as: 

n n n 

^gwt -J2% u « D F (u,wi) + J2% (V 2 F(u> t )) -1 4 (5) 
t=i t=i *=i 

In this paper, we restrict our attention to the combinatorial learning setting in which S is a subset of 
{0, l} d . However, one should note that this specific form of S plays no role in the definition of CLEB, 
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Figure 3: The figure sketches the relationship of the algorithms studied in this paper with arrows representing 
"is a special case of". Dotted arrows indicate that the link is obtained by "expanding" S, that, is s eeing S as 
the set of basis vector in M) s \ rather than seeing it as a (structured) subset of {0, l} d (see Section 3.1 1. The 



six algorithms on the bottom use a Legendre function with a diagonal Hessian. On the contrary, the FTRL 



algorithm (see Section 3.3 1 may consider Legendre functions more adapted to the geometry of the convex 



hull of S. POLYINF is the algorithm considered in Theorem 22 



meaning that the algorithm on Figure [2] can be used to handle general online linear optimization problems, 
where S is any subset of R d . 

3 Different instances of cleb 

In this section we describe several instances of CLEB and relate them to existing algorithms. Figure[3]sum- 
marizes the relationship between the various algorithms introduced below. 



3.1 exp2 (Expanded Exponentially weighted average forecaster) 

The simplest approach to combinatorial prediction games is to consider each vertex of S as an independent 
expert, and then apply a strategy designed for the expert problem. We call EXP2 the resulting strategy when 



one uses the traditional exponentially weighted average forecaster (also called Hedge, Freund and Schapire 
i), see Figure|4] In the full information game, EXP2 corresponds to Expanded Hedge defined in Koolen 
et al.| (2010i, where it was studied under the assumption. It was also studied in the full information game 
under the L 2 assumption in Dani et al. ( 2008| l. In the semi-bandit game, EXP2 was studied in Gyorgy et al. 
(2007) under the assumption. Finally in the bandit game, EXP2 corresponds to the strategy proposed by 
Dani et al. (|2008[ ) and also to the ComBand strategy, studied under the assumption in 



Cesa-Bianchi and 



Lugosi (2009) and under the L 2 assumption in |Cesa-Bianchi and Lugosi" ( 2010| ). (These last strategies differ 
in how the losses are estimated.) 

EXP2 is a CLEB strategy in dimension IS*! that uses V = [0, +oo)l s l and the function F : u i— > 

- Yl[=i u i l°g( u i)' f° r some V > (this can be proved by using the fact that the Kullback-Leibler projec- 
tion on the simplex is equivalent to a Li -normalization). The following theorem shows the regret bound 
that one can obtain for EXP2 (for instance with Theorem [5] applied to the case where S is replaced by 



S> = {ue{0,l}W:E v 



es 



!})• 
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EXP2: 

Parameter: Learning rate r\. 
Letwi = (^,...,^) GK |S| . 
For each round t = 1,2, ... ,n; 

(a) Let p t the distribution on S such that p t (v) = w Vt t for any »6S. 

(b) Play Vt ~ pt and observe 

- the loss vector It in the full information game, 

- the coordinates €i,tlv 4 t =i in the semi-bandit game, 

- the instantaneous loss ifVt in the bandit game. 

(c) Estimate the loss vector £ t by £ t . For instance, one may take 

- it = it in the full information game, 

- It t = v' — K t in the semi-bandit game, 

- i" t = P^VtV t T £ t , with Pj = E„^ Pt (vv T ) in the bandit game. 

(d) Update the weights, for all v £ 5, 

exp(-rjlf v)w v ,t 
w V} t+i — ~ • 



Figure 4: EXP2 forecaster. 



Theorem 3 For the EXP2 forecaster, provided that E£ t — £ t , we have 



2 

t=i «es 



3.2 linexp (Linear Exponentially weighted average forecaster) 

We call LINEXP the CLEB strategy that uses V = [0, +oo) d and the function F : u H> - J^i=i u i log(uj) 
associated to the Kullback-Leibler divergence, for some r\ > 0. In the full information game, LINEXP corre- 



sponds to Component Hedge defined in Koolen et al. (2010|l, where it was studied under the assumption. 



In the semi -bandit game, LINEXP was studied in Uchiya et al. ( 2010| l, Kale et al. ( 20 1 0) l under the as 



sumption, and for the particular set S with all vertices of L\ norm equal to some value k. 

3.3 ftrl (Follow the Regularized Leader) 

If Conv(S) C V and w\ € argmin^gp F{w), steps (d) and (e) are equivalent to 

w t +i € argmin £ l s w + F(w) , 

M)GConv(S) V 3=1 / 

showing that in this case CLEB can be interpreted as a regularized follow-the-leader algorithm. This type of 
algorithm was st udied in |Abernethy and Rakhlin| ( |2009| l in the full information and bandit setting (see also 
the lecture notes Rakhlin and Tewari ( 2008| l). Asurvey of FTRL strategies for the full information game can 
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be found in |Hazan| ( |2010| l. In the bandit game, FTRL with F being a self-concordant barrier function and a 
different estimate than the one proposed in Figure[2]was studied in Abernethy et al. (2008 1. 



3.4 lininf (Linear Implicitly Normalized Forecaster) 

Let / : M. d — >• BL The function / has a diagonal Hessian if and only if it can be written as /(it) = 
J2i=i /i( u i)> f° r some twice differentiable functions /j : K — > R, i = l,...,d. The Hessian is called 
exchangeable when the functions f [',... , f'J are identical. In this case, up to adding an affine function of u 
(note that this does not alter neither the Bregman divergence nor CLEB), we have f(u) = Y^,i=i 9( u i) f° r 
some twice differentiable function g. In this section, we consider this type of Legendre functions. To under 



line the surprising linl^jwith the Implicitly Normalized Forecaster proposed in Audibert and Bubeck (2010 1, 



we consider g of the form x H > f ip 1 (s)ds, and will refer to the algorithm presented hereafter as LININF. 

Definition 4 Let u) > 0. A function ip : (— oo, a) — > M*^_for some a € K U {+00} is called an to-potential if 
and only if it is convex, continuously differentiable, and satisfies 

lim ip(x) — lj lim ip(x) = +oo 

x— >— oo x— >a 

rui+1 

ip'>0 / lip- 1 (s)\ds < +oo. 

J UJ 

Theorem 5 Let uj > and let ip be an uj -potential function. The function F defined onD = [uj. +oo) rf by 
F( u ) — Sz=i Ju % ' l P~ l { s )ds is Legendre. The associated CLEB satisfies, for any u £ Conv(S) HT>, 



n d 



(6) 



IO lb 11/ \Ji 

elm - <d f (u,w 1 ) + -J2Y, %,t max (^'(f'K')) > ^ / (^- l K t ) - kt) 

t=l t=l t=l i=l 

where for any (u, v) G T> X Int{T>), 

D F {u,v)=J2( [ 1 ^(s^s-K-t^" 1 ^))- (7) 
i=i / 

/« particular, when the estimates are nonnegative, we have 



f- 

l i.t 



t=l t=l t=l i=l yr ' v ' v 

Proof It is easy to check that F is a Legendre function and that (|7) holds. We also have VF*(it) = 
(VF)- 1 ^) = (ip(ui),...,ip(u d )), hence 



From the Taylor-Lagrange expansion, we have Dp * (u, i?) < ^Z i=1 rjaax se r U4il)i i \ip'{s){ui — Vi) 2 . Since the 
function -0 is convex, we have max se ^'(s) < ip' [ max(Mi, Uj)) , which gives the desired results. ■ 



Note that LINEXP is an instance of LININF with ip : x M> exp(r]x). On the other hand, Audibert and 



Bubeck (2010 1 recommend the choice ip(x) = (—T)x) q with r\ > and q > 1 since it leads to the minimax 



3 detailed in AppendixjX] 
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optimal rate \fnd for the standard d-armed bandit game (while the best bound for Exp3 is of the order of 
V nd log d) . This corresponds to a function F of the form F(u) — — fc~n^ Sf=i • We refer to the 

corresponding CLEB as LINPOLY. In Appendix[A|we show that a simple application of Theorem|5]proves that 
LINPOLY with q = 2 satisfies R n < 2\2nd. This improves on the bound R n < 8Vnd obtained in Theorem 
11 of |Audibert and Bubeck| ( |20T0l ). 

4 Full Information Game 

This section details the upper bounds of the forecasters EXP2, LINEXP and LINPOLY under the L 2 and 
assumptions for the full information game. All results are gathered in Table [2] (page |3J. The proofs can be 
found in Appendix |B] Up to numerical constants, the results concerning (EXP2, L 2 and Loo) and (LINEXP, 
Loo) appeared or can be easily derived from respectively Dani et al. ( 2008 ) and Koolen et al. ( 2010) >. 



Theorem 6 (LINEXP, Lqo) Under the assumption, for LINEXP with i t = it, fj = \/2/n and w\ = 
argmin„, eCom , (5) D F (w, (1, . . . , 1) T ), we have 

R n < dy2n. 



Theorem 7 (LINEXP, L2) Under the L2 assumption, for LINEXP with l t — It, r\ = \j2djn and w\ — 
&rgmm weComis] D F (w, (1, . . . , 1) T ), we have 

Rn < V2nd. 

Theorem 8 (LINPOLY, Lqo) Under the Loo assumption, for LINPOLY with I t — it, r\ — q ^x)n an d 
W X = argmin weCom , (5) D F (w, {I,..., 1) T ), we have 

— , I 2qn 

Rn < d 1 



Theorem 9 (LINPOLY, L 2 ) Under the L 2 assumption, for LINPOLY with i t — it, rj — J ? Mrt and w\ = 



9-1 

q(q-l)n 



argmin weCom , (5) D F (w, (I,..., 1) T ), we have 

2qdn 



g-i' 



Theorem 10 (EXP2, Lqo) Under the Loo assumption, for EXP2 with i t = it, we have 

- rflog2 rind 2 

ti n S 1 ■ 

77 2 



In particular for rj = \ 2 °f 2 , we have R n < \/2<i 3 nlog2. 



From Theorem 19 the above upper bound is tight, and consequently there exists S for which the algorithm 



EXP2 is not minimax optimal in the full information game under the Loo assumption. 
Theorem 11 (EXP2, L 2 ) Under the L 2 assumption, for EXP2 with i t — it, we have 

d log 2 rjn 



Rn < 



n 



In particular for rj = \l 2rflog2 , we have R n < \J2dn log 2. 
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5 Semi-Bandit Game 



This section details the upper bounds of the forecasters EXP2, LINEXP and LINPOLY under the L 2 and L 
assumptions for the semi -bandit game. These bounds are gathered in Table [2] (page [3]). The proofs can be 
found in Appendix [C] Up to the numerical constant, the result concerning (EXP2, Loo) a ppeared in |G yorgy 



et al. (2007 1 in the context of the online shortest path problem. Uchiya et al. (2010 1 and Kale et al.| ( |2010[ ) 



studied the semi-bandit problem under the L M assumption for action sets of the form S = {v £ {0, l} d 

v i — ^} f° r some value k. Their common algorithm corresponds to LINEXP and the bounds are of 
order knd\og(d / k) . Our upper bounds for the regret of LINEXP extend these results to more general sets 
of arms and to the L 2 assumption. 



Theorem 12 (LINEXP, Loo) Under the Loo assumption, for LINEXP with 



M 



Wi = argmin,„ eam , (iS) D F (w, (1, 



, 1) T ) , we have 

R n < dV2n- 



' via ' 



'2/n and 



Since the Li assumption implies the Loo assumption, we also have R n < d\J2n under the L 2 assumption. 
Let us now detail how LINEXP behaves for almost symmetric action sets as defined below. 

Definition 13 The set S c {0, l} d is called almost symmetric if for some k £ {l,...,d}, S C {v £ 
{0, l} d : Yli=i v i ^ ^} an d Conv(S) PI [A ; l] d ^ 0. The integer k is called the order of the symmetry. 



Uchiya et al. 


(2010 


) and 


Kale et al. 


(2010 



The set S = {v g {0, l} d : 2»=i v i = ^} considere 
particular almost symmetric set. 

Theorem 14 (LINEXP, almost symmetric S) Let S be an almost symmetric set of order k £ {1, . . . , d}. 

f) T ).L e f£ = max(log(^),l). 



Consider LINEXP with i 



^,45^7 andw 1 = argmin D F {w, (|, 

w£Conv(S) 



Under the Loo assumption, taking r\ 



'-, we 



have R n < \/2kndC. 



• Under the L 2 assumption, taking r] — ky we have R n < 2\/ ndC. 

In particular, it means that under the L 2 assumption, there is a gain in the regret bound of a factor \J d/C 
when the set of actions is an almost symmetric set of order k. 



Theorem 15 (LINPOLY, Loo) Under the Loo assumption, for LINPOLY with In — t^t^r^, V = 
and wi = &rgmm weComis) D F (w, (1, . . . , 1) T ) , we have 

Rn<d 



9(9-1)™ 



2qn 



9-1 



Theorem 16 (LINPOLY, L 2 ) Under the L 2 assumption, for LINPOLY with 
and wi = &rgmm weCom ,( S) D F (w, (1, . . . , 1) T ) , we have 

Rn < 



Vi,t 

Wit' 



2d 9 
9(9-1)™ 



2qnd 1 _ : 
a 



, q - 1 

In particular, for q = 1 + (loge?) -1 , we have R n < y // 2nde log(ed). 
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Theorem 17 (EXP2, L^) Under the assumption, for the EXP2 forecaster described in Figure^using 

— dlog2 i]nd 2 
tin S 1 ~ — ■ 



t = li i we /lave 



In particular for r) = y 2 °f 2 , we /lave i?„ < \/2(i 3 nlog2. 



The corresponding lower bound is given in Theorem 19 



Theorem 18 (EXP2, L 2 ) Under the L 2 assumption, for EXP2 with i^t = U,t^f^> we have 



Rn < 



d log 2 r\nd 

T) + ~2~' 



In particular for r\ — y 21 ° g2 , we have R n < d\/2?ilog2 

Note that as for LINEXP, we end up upper bounding J 
5 of order fc, this sum can be bounded by 2d/k, while log(|<S|) is upper bounded by k \og(d + 1). So as for 
LINEXP, this leads to a regret bound of order yjnd log d when the set of actions is an almost symmetric set. 



Note that as for LINEXP, we end up upper bounding Yli=x &i,t by d. In the case of almost symmetric set 



6 Bandit Game 



The upper bounds for EXP2 in the bandit case proposed in Table [2] (page [3j are extracted from Dani et al. 
(2008 ). The approach proposed by the authors is to use EXP2 in the space described by a barycentric spanner. 
More precisely, let m = dim(Span(5)) and e\, . . . , e m be a barycentric spanner of S; for instance, take 
(ei,...,e m ) £ axgmaX( a . lv>>)a . m)eS m Idet. 
introduce the transformations T\ : M d 



ispan(5)( a; i) • • • i x m)\ (see |Awerbuch and Kleinberg 
► R m such that for x £ 



Tjx) 



X 6 m ) 



2004). We 
and 



T 2 : S -+ [-1, l] m such that for v £ S, v = J2T=A T ^ V ))^- Note that for an y « £ 5, we have l{v = 
Ti(£t) T T2(v). Then the loss estimate for v £ S is 

£fv = (QtT2(Vt)T 3 (V t ) T T 1 (it)) T T 2 (v), where Q t = E v ^ Pt T 2 {V)T 2 {V) T . 

Moreover the authors also add a forced exploration which is uniform over the barycentric spanner. 

A concurrent approach is the one proposed in |Cesa-Bianchi and Lugo si (2009 [201 0| > . There the authors 
study EXP2 directly in the original space, with the estimate described in Figure [4] and with an additional 
forced exploration which is uniform over S. They work out several ex amples of sets S for which they im- 
prove the regret bound by a factor v d with respect to|Dani et al. ( 2008 1. Unfortunately there exists sets S for 



which this approach fails to provide a bound polynomial in d. In general one needs to replace the uniform ex- 
ploration over S by an exploration that is tailored to this set. How to do this in general is still an open question. 

The upper bounds for LINEXP in the bandit case proposed in Table [2] (page [3j are derived by using the 
trick of Dani et al. (2008 ) (that is, by working with a barycentric spanner). The proof of this result is omitted, 
since it does not yield the optimal dependency in n. Moreover we can not analyze LINPOLY since ([TJ is not 
well defined in this case, because £ t can be non-positive. In general we believe that the LININF approach is 
not sound for the bandit case, and that one needs to work with a Legendre function with non-diagonal Hessian. 



The only known CLEB with non-diagonal Hessian is the one proposed in Abernethy et al. (2008 1, where 



the authors use a self-concordant barrier function. In this case, they are able to propose a loss estimate related 
to the structure of the Hessian. This approach is powerful, and under the L 2 assumption leads to a regret 
upper bound of order d\jBn log n for 9 > such that Conv(<S) admits a 0-self-concordant barrier function 
(see Abernethy et al. 2008| section 5). When Conv(5) admits a 0(l)-self-concordant barrier function, the 
upper bound matches the lower bound O (dy/n\ . The open question is to determine for which sets S, this 
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7 Lower Bounds 



We start this Section with a result that shows that EXP2 is suboptimal against adversaries. This answers 
a question of Koolen et al. ( 2010) >. 

Theorem 19 Let n > d. There exists a subset S C {0, l} d such that in the full information game, for the 
EXP2 strategy (for any learning rate n), we have 

supi?„ > 0.02 d 3/2 ^, 

where the supremum is taken over all adversaries. 

Proof For sake of simplicity we assume here that d is a multiple of 4 and that n is even. We consider the 
following subset of the hypercube: 

d/2 



v G {0, l} d :J2 Vi = d l A and 

i=l 



= l,V*G{d/2 + l;...,d/2 + d/4}l or I m = 1, Vt G {d/2 + d/4 + 1, . . . , d} 

That is, choosing a point in S corresponds to choosing a subset of d/4 elements in the first half of the coor- 
dinates, and choosing one of the two first disjoint intervals of size d/4 in the second half of the coordinates. 



We will prove that for any parameter n, there exists an adversary such that Exp (with parameter rf) 

nd 
16 

: VA\ A taf 

2r, 

nd /r]d\ ( <ilog2 nd\ 



has a regret of at least y| tanh ( g ), and that there exists another adversary such that its regret is at least 
min ( jo g2 ; ff )■ As a consequence, we have 



. nd /?/«\ . / dlogz nd \ 

sup R„ > max — tanh — , mm , — 

i ~ \16 V 8 /' V U V 12/ 

. /j]d\ d log 2 \ nd \ / . nd 

> min max — tanh — , , — > min A, — 

16 V 8 /' 1277 / 12/ ~ V ' 12 



with 



,'nd /«d\ dlog2 
A = mm max — tanh 



r,e[o,+oo) \16 V 8 / 1277 

. f . nd /??d\ . /nd /"d\ dlog2 

> mm < mm — tanh — , mm max — tanh — , 

\*7d>8l6 V 8 / \d<8 \16 U/ 12t} 

■ \ nd . . . f ndrjd 1 dlog2 

> mm < — tanh 1 , mm max , . . , 

\16 V ' -;d<8 \16 8 tanh(l)' 12n 



>minJ — tanh(l), J W31 ° g2 1 , s \ > min (0.04 nd, 0.02 d 3 / 2 ^). 

| 16 w y 128 x 12 x tanh(l) J ~ v ' v ; 

Let us first prove the lower bound ?| tanh ( ^ ) ■ We define the following adversary: 

( 1 if i G {d/2 + 1; . . . , d/2 + d/4} and t odd, 
li,t=l 1 if i G {d/2 + d/4 + l,...,d} and t even, 
[ otherwise. 

This adversary always put a zero loss on the first half of the coordinates, and alternates between a loss of d/4 
for choosing the first interval (in the second half of the coordinates) and the second interval. At the beginning 
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of odd rounds, any vertex v € S has the same cumulative loss and thus Exp picks its expert uniformly at 
random, which yields an expected cumulative loss equal to nd/ 16. On the other hand at even rounds the 
probability distribution to select the vertex v 6 S is always the same. More precisely the probability of 
selecting a vertex which contains the interval {d/2 + rf/4+ 1, . . . , d} (i.e, the interval with ad/4 loss at this 
round) is exactly 1+cxp ( 1 _^ t ;/4) ■ This adds an expected cumulative loss equal to ^ i+exp(-?)d/4) • Fi na Hy note 
that the loss of any fixed vertex is nd/8. Thus we obtain 

•= nd nd 1 nd nd , /nd\ 

R n = 1 — = — tanh ( — . 

16 8 1 + exp(-r)d/4) 8 16 V 8 / 

We move now to the dependency in 1/rj. Here we consider the adversary defined by: 

1 - e if i< d/A, 

1 if ie{d/4+l,...,<f/2}, 

otherwise. 

Note that against this adversary the choice of the interval (in the second half of the components) does 
not matter. Moreover by symmetry the weight of any coordinate in {d/4 + 1, . . . , d/2} is the same (at any 
round). Finally remark that this weight is decreasing with t. Thus we have the following identities (in the big 
sums i represents the number of components selected in the first d/4 components): 

( n <//2 . n 

t=li=d/A+l ' t=l 

_ ned T,ves-.v d/2 =i exp(-rpi%v) 



4 E„ G s exp(-»7n^u) 
ned EtiV m G/4-7-i) "pHrfnd/4 - ins)) 



EnifK^t^M-vind/i-ins)) 

d/4-: 
d/A-i- 



nedE%t' 1 (T)L% 4 --^M^ne 



\d/i-i) 



nedEttH^-^midlit^Mvins) 



where we used (^l^J = (l - ^) ( d %t^) in the last equality. Thus taking e = min l) yields 

. ,'dlog2 nd\ E^ 4 o' 1 (l-f)( d { 4 ) 2 min(2,cxp( ?? n))' . f d\og2 nd 
ri n s> mm — , — — ^ > mm 



^ ' ^ E^o( d { 4 ) 2 -in(2,exp(,n))^ " V ^ '12 



where the last inequality follows from Lemma 23 (see Appendix [E]). This concludes the proof of the lower 
bound. ■ 



The next two theorems give lower bounds under the three feedback assumptions and the two types of 
adversaries. The cases (L2, Full Information) and (L2, Bandit) already appeared in Dani et al. (2008), while 
the case (L m , Full Information) was treated in Koo len et al.| ( |2010] l (with more precise lower bounds for 
subsets S of particular interest). Note that the lower bounds for the semi-bandit case trivially follow from the 
ones for the full information game. Thus our main contribution here is the lower bound for (Loo, Bandit), 
which is technically quite different from the other cases. We also give explicit constants in all cases. 
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Theorem 20 Let n > d. Against adversaries in the cases of full information and semi-bandit games, we 
have 

R„ > 0.008 dy/n, 

and in the bandit game 

R n > 0.01 d 3/2 \/n- 

Proof In this proof we consider the following subset of {0, l} d : 

S = {ve {0, l} d : Vi € {1, . . . , \d/2\ }, v 2i -i + v 2i = 1}. 

Under full information, playing in S corresponds to playing \d/2\ independent standard full information 
games with 2 experts. Thus we can apply [Theorem 30, Audibert and Bubeck (2010)] to obtain: 

Rn > [d/2\ x 0.03y/n log 2 > 0.008 d^. 

We now move to the bandit game, for which the proof is more challenging. For the sake of simplicity, we 
assume in the following that d is even. Moreover, we restrict our attention to deterministic forecasters, the 
extension to general forecaster can be done by a routine application of Fubini's theorem. 

First step: definitions. 

We denote by In <E {1, 2} the random variable such that V 2 %,t = 1 if and only if In = 2. That is, li t is 
the expert chosen at time t in the i th game. We also define the empirical distribution of plays q n = (q\ n , q\ n ) 

in game i as A n = — t=1 * ,t-J . Let „ be drawn according to q\. 

In this proof we consider a set of 2 d l 2 adversaries. For a = (ai, . . . , a d / 2 ) € {1, 2} d / 2 we define the 
a-adversary as follows: For any t € {1, . . . , n}, the loss of expert a, in game i is drawn from a Bernoulli of 
parameter 1/2 while the loss of the other expert in game i is drawn from a Bernoulli of parameter 1/2 + e. 
We note E Q when we integrate with respect to the reward generation process of the a-adversary. We note 
Fi a the law of J i n when the forecaster plays against the a-adversary. Remark that we have Pj, a (J, in 
j = E Q ^ 52t=i ^-h t=j> h ence ^ against the a-adversary we have: 

n d/2 d/2 

R n = E a ^2^2eli z t ^ ai =ne'yi(l- F ita (J iit = on)) , 

t=l i=l i=l 

which implies (since the maximum is larger than the mean) 

m ( 1 \ 

sup R n >neY^ 1 - ^Ta Z! F i,<*( J i,n = «i) • (9) 



ae{l,2}<»/- i=1 
Second step: information inequality. 



2 d/2 

a£{l,2}<V 2 



Let P_i jQ , be the law of Ji „ against the adversary which plays like the a-adversary except that in the 
i th game, the losses of both coordinates are drawn from a Bernoulli of parameter 1/2 + e (we call it the 
(— i, a)-adversary). Now we use Pinsker's inequality which gives: 



?■;...(■/;.„ : : <«,) :? ;,,(./;.„ = n,-)+ \/ ^KL(P_^ Q , P^q), 



and thus, (thanks to the concavity of the square root) 

' Z P *A^,n =Oti) < \ + 



Q£{l,2} d / 2 \ u£{l,2} d ' 2 



' ^ KL(P_i, a ,P i)a ). (10) 
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Third step: computation o/KL(P_j iQ , Pj ja ) with the chain rule for Kullback-Leibler divergence. 

Note that since the forecaster is deterministic, the sequence of observed losses (up to time n) W n € 
{0, 1, . . . , d} n uniquely determines the empirical distribution of plays q l n , and in particular the law of Jj )n 
conditionally to W n is the same for any adversary. Thus, if we note P™ (respectively P™ i a ) the law of W n 
when the forecaster plays against the a-adversary (respectively the (— i, a)-adversary), then one can easily 
prove that KL(P_j iQ ,,Pj )0 ,) < KL(P™ 4 Q ,P"). Now we use the chain rule for Kullback-Leibler divergence 
iteratively to introduce the laws P^ of the observed losses Wt up to time t. More precisely, we have, 

71 

KL(P^ )Q ,P») =KL(Pi liCt ,pi) + ^ £ P^K-i) K ^ P Ua(>*-i)> P «OK-i)) 

t=2 ^.^{O,!,...^}*- 1 
n 

t=2 w t - 1 :I i: t=a i 

where B Wt _ 1 and B' w are sums of d/2 Bernoulli distributions with parameters in {1/2, 1/2 + e} and such 
that the number of Bernoullis with parameter 1/2 + e in B Wt _ 1 is equal to the number of Bernoullis with 
parameter 1/2 + e in B' w plus one. Now using Lemma 

i^E_ liQ J2t=i 1 U,t=a z - Summing and plugging this into {TO) we obtain ^ J2 a e{i,2}"/^ ^i,a{Ji,n = 
a i) < \ + 2e^/^- To conclude the proof one needs to plug in this last equation in (|9]l along with straightfor- 
ward computations. ■ 

Theorem 21 Let n > d. Against L2 adversaries in the cases of full information and semi-bandit games, we 
have 

R n > OmVdn, 

and in the bandit game 

R n > 0.05 min(n, d^fn). 
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Parameters: set of actions A = {1, . . . , d}; number of rounds n 6 N. 
For each round t = 1,2, ... ,n; 

(1) the forecaster chooses At € A, with the help of an external randomization; 

(2) simultaneously the adversary selects a loss vector £ t = (ii,t, ■ ■ ■ ,id,t) T E K d (with- 
out revealing it); 

(3) the forecaster incurs the loss lA t ,t- He observes 

- the loss vector It in the full information game, 

- the coordinate lA t ,t in the bandit game. 

Goal: The forecaster tries to minimize his cumulative loss 2~^fc=i ^ A t,t- 



Figure 5: Standard prediction games. 



A Standard prediction games 

It is well-known that the standard prediction games described in Figure [5] are specific cases of the combina- 
torial prediction games described in Figure[T] Indeed, consider S — {a.\, . . . , a^}, where E {0, l} d is the 
vector whose only nonzero component is the i-th one. The standard and combinatorial prediction games are 
then equivalent by using V t = a.A t and noticing that if&i = 1^. In particular, the semi-bandit and bandit 
combinatorial prediction games are then both equivalent to the traditional multi-armed bandit game. 



We now show that INF (defined in Figure 2 of Audibert and Bubeck ( 2010) >) is a special case of LININF. 



Proof Indeed, suppose that the estimates l\,...,l n are nonnegative (coordinate-wise), and take w\ 
[hi - ■ ■,¥)■ Then the vector w' t+1 satisfying ([TJ exists, and is defined coordinate-wise by ^ 1 (w' i t+1 ) = 
ip~ 1 (wij) — li.t - Besides, the optimality condition of (|2]i implies the existence of ct E K (independent of i) 
such that ^> _1 (t/; i)t+1 ) = ^"H^t+i) + c t . It implies ij)' 1 {w i>t ) = ip" 1 (w i ^ 1 ) - Y? s =i($i,s ~ c s ) for any 
t > 1. So there exists Ct E K such that Wij = ^(X^s=i(^ — ^»,«) — @t)- Since w f E Conv(<S), the constant 
C t should satisfy £" =1 w n ~ ^- ^ e tnus recover INF with the estimate 1 — £ i s of the reward 1 — £ i t . So 
the Bregman projection has here a simple solution depending on a unique constant Ct obtained by solving 
the equality ^( E'^iX 1 - h, s ) - C t ) = 1. " ■ 



Next we show how to obtain the minimax \fnd regret bound, with a much simpler proof than the one 
proposed in Audibert and Bubeck ( 2010| l, as well as a better constant. 



Theorem 22 Let q > 1. For the INF forecaster {that is for CLEB with w\ = 
{ai, . . . , a^}) using ip(x) = (—rjx)~ q and i 



! i-t = h,t^r-, we have 



and S = 



R n < 



qr/nd 1 a 



V(q - 1) 2 
In particular for r\ = \/2d~5~z [(q — l)n]~i, we have R n < q^J 

In view of this last bound, the optimal q is q = 2, which leads to R n < 2 \J 2nd. This improves on the 

(2010 1. The INF forecaster with the 



bound R n < 8\/nd obtained in Theorem 1 1 of 
above polynomial i/> is referred to as POLYINF in Figure [3] 



Audibert and Bubeck 
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Proof We apply ((8). First we bound the divergence term. We have ip 1 (s) = —is 1 / q , and D F (u,v) 
$ Eti (^=T u i * - ^T M ! " + u i v i " ) > hence 



max 

utEConv(5) 



1 ^(«^)=^((l,0,.. I Q)^(i,.. > i) T ) = ^(di-l). 
Combining this with (ip -1 )' (w^t) — -^ w i t 9 an d {D> we obtain 

,vy ; t=l i=l 

d ~ d - 1 1 

where in the last step we use that by Holder's inequality, $^ i=1 (io/ t x 1) < ( 5Z i=1 w m) 5 x <^ ' ■ ■ 



B Proofs of Theorems in Section |4] 
Proof of Theorem |6] 

We have D F (u, v) = ~ X^i=i ( M « 1°S (^r) — u » + v ij > hence from the Pythagorean theorem, 

D F (u, Wl ) < D F (u,(l,...,l) T ) < -. 
Since we have ™i,t < 4 Theoremgimplies i?„ < i + §E£™ =1 £? =1 «><,i*i,i < f t + - 



dr/ 
2~" 



Proof of Theorem 

As in the previous proof, we have D F (u,w\) < ^, but under the Li constraint, we can improve the bound 

on Yfi=i w i,t£f,t b y usin g Z)i=i Wi,t£i,t < 1 (since to t € Conv(S)). This gives R n < i + 



Proof of Theorem |8] 

We have D F {u,v) = (^=i v i " ~ ^T^i ' + u i v i " )> hence D F (u, Wl ) < Since 

have wjt'^t < 1. Theoremlimplies i?„ < ^ + f E £?=i £? =1 «#^,t < ^ 



>?(?-!) 



we 

ndqri 



Proof of Theorem |9] 

As in the previous proof, we have D F (u,wi) < ^T^rn • Under the L 2 constraint, we can improve the bound 
on £\ =1 t 9 ^ t by using £\ =1 iUj,t4,t < 1 (since io t € Conv(S)). This gives 
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Proof of Theorem [10] 

Using log(|«S|) < d log 2, < tjv < d and J2 v es u>v , t = 1 m Theorem^ we get the result. 

Proof of Theorem HT1 

Using log(|«S|) < d log 2, < ifv < 1 and J^ves w v,t = 1 in Theorem^ we get the result. 

C Proofs of Theorems in Section |U 
Proof of Theorem [12] 

We have again Dp(u, Wi) < ~. Since we have X^=i ^i,t _• d, Theoremjsjimplies 

T n d T T , n d , T 

Proof of Theorem [H 

The starting point is Theorem^ which, by using E(^? t ^^-) = E^? t < E^. t , implies 

n d y n d 

Rn < D F (u,wi) + ?E£XXt— < D F (u,w 1 ) + ~E^^£ i)t . 

4=1 i=l 1,1 4=1 i=l 

For any u € [0, l] d such that Yli=i u i — ^> we ^ ave 

Dr (u^)<D F {u,(l ^) T )<i( fc + |:».lo g (^))<f. 02) 

where the last inequality can be obtained by writing the optimality conditions. More precisely, two cases are 
considered depending on whether holds J2i=i u i = k at the optimum: when it is the case, the maximum is 
achieved for u of the form u = (1, . . . , 1, 0, . . . , 0) T ; otherwise, u = (0, . . . , 0) T achieves the maximum. 
The desired results are then obtained by combining ( pT| , ( p~2] > and an upper bound on Yli=i indeed, 
under the assumption, we have Y^t=i ^i,t _• d. Under the Li assumption, since iS is an almost symmetric 
set of order k, there exists z E Conv(<S) n [£; l] d , and consequently ^i,t — Si=i (j^ z i)^i,t _• tt< 

Proof of Theorem H5l 

We have again D F (u,Wi) < rj ^i) ■ Since we have w,^ t £f t < 1, Theoremjsjimplies 

Proof of Theorem H6l 

We have again D F (u,Wi) < ^ly- From E(u^*I? t ) = E(tu? t ^? t ) < E[(w iit £ i>t )i] and Theorem 
we get 

~ r,(q-l) 2 LV *•* My J " 77(9-1) 2 

where we use J2i=i( w i,tti,t)° < ( £? =1 Wi,di,t) " x ? in the last step. 
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Proof of Theorem [17] 

Let q i<t = Y, ve s:v i =iPv,t = E v t ~ Pt Kt fori £{l...,d}. We have 



v£S 



^Vt~jH,V{~p t — n V i,t V j,t 

77 ii,tqj,t 



2 

■2 



Using log(|<S|) < cilog2, the result then follows from Theorem[3] 

Proof of Theorem [18] 

Let q ht = Y,v<eS:v z =i Pv,t = E v t ~ Pt Vij for i e {1, ... , d}. We have 
~rl Qi.tQj.t 

v£S 1,3 ' J ' 

< E ^/E^r^„ 

9i,t <7j,t 

= Ev t E ^ E < Ev t E ^ = E ^ < d. 

2=1 ' J— 1 2—1 ' 2—1 

Using log(|«S|) < dlog2, the result then follows from Theorem[3] 

D Proof of Theorem HH 



We consider the bandit game first. We use the notation and adversaries defined in the proof of Theorem 20 
We modify these adversaries as follows: at each turn one selects uniformly at random E t G {1, . . . ,d}. Then, 
at time t, the losses of all coordinates but E t are set to 0. This new adversary is clearly in L2. For this new 



set of adversaries, one has to do only two modifications in the proof of Theorem 20 First |9| is replaced by: 



d/2 

■ ns \ 
sup R n > — } 



li 1 -^ E VUJi,n = *)\ 
1 \ a£{l,2} d / 2 / 



Second B Wtl is now a Bernoulli with mean n t G \h + % , \ + § ] and B' w is a Bernoulli with mean [i t — §, 
and thus we have 



4e 2 



(1 -e 2 )^ 

The proof is then concluded again with straightforward computations. 

The proof for the full information game is exactly the same as the one for bandit information, except that 
the definition of W± is slightly different and implies that B Wt _ 1 is now a Bernoulli with mean h (| + e) and 
B' w is a Bernoulli with mean gk, which gives 

4e 2 
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E Technical lemmas 



We prove here two technical lemmas that were used in the proofs above. 
Lemma 23 For any k G N* ,for any 1 < c < 2, we have 

ELo(i-*A)© 2 C 1 



£!U(i) ci 



> 1/3. 



Proof Let /(c) denote the left-hand side term of the inequality. Introduce the random variable X, which 

is equal to i £ {0, . . . , k} with probability (^c'/Ejto ^ ° j ■ We have /'( c ) = l^i 1 ~ X / k )\ ~ 
-E(X)E(1 — X/k) = — iYarX < 0. So the function / is decreasing on [1,2], and, from now on, we 
consider c — 2. Numerator and denominator of the left-hand side (l.h.s.) differ only by the 1 — i/k factor. 
A lower bound for the left-hand side can thus be obtained by showing that the terms for i close to k are not 
essential to the value of the denominator. To prove this, we may use the Stirling formula: for any n > 1 

(™) n V27ni <n\< (^"v^ne 1 /' 12 ") (13) 

Indeed, this inequality implies that for any k > 2 and i £ [1, k — 1] 

(!) 



hence 



k-iJ y/2Tri(k - i) \ij \k-iJ y/2m(k - i) 

k\ 2i f k \2(M ke' 1 / 3 fk\ 2 /k\ 2 W k \2(k-i)k e y6 



fk\ 2i f k \2(fc-i) fee" 1 / 3 /fcV ( k \ 2i ( 
\i) \k-i) 2m(k-i) < \i) < VlJ V/ 



k — i) 2ni 



Introduce X = i/k and x(^) = x^a/i^^u-a) ■ We nave 



A 2A(!_ A )2(1-A) 

W*>1*TT <(*)*< W*>1*S*- ,14) 



Lemma 23 can be numerically verified for k < 10 6 . We now consider k > 10 6 . For A > 0.666, since 
the function \ can b e shown to be decreasing on [0.666, 1], the inequality (*) 2* < [x(0-666)] fc 2x o e 666x7r 
holds. We have x(0.657)/x(0.666) > 1.0002. Consequently, for k > 10 6 , we have [x(0.666)] fc < 0.001 x 
[x(0.657)] fc //c 2 . So for A > 0.666 and k > 10 6 , we have 

; ' V 2* < 0.001 x [ x (0.657)] fc — e ^—— < [x(0.657)] fc ' " 



2tt x 0.666 x k 2 LAV ' n lOOOvrfc 2 

2e- 1 /3 



mm 



X(A)] fe : 



',.2 



AG [0.656,0.657] lOOOrrfc 

1 fk^ 2 



< max 2*. (15) 

1000k »e{i,...,fe-i}n[o,o.666fe) \i J 

where the last inequality comes from ( fT4] > and the fact that there exists i € {1, . . . , k — 1} such that i/fc <E 
[0.656, 0.657]. Inequality (B) impl ies that for any i G {1, . . . , we have 

^ U/ 1000 ie{i,...,fc-i}n[o,o.666fc) \i) 1000 ^ 

|fe<i<fe V 7 ' V 7 0<i<0.666fc 
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To conclude, introducing A = J2o<i<o 666fe (i)^' we nave 



Eto(l-»/fc)C)( fc -,)2* (1 - 0.666)^ 1 
SlU ' A + 0.00L4 "3- 



Lemma 24 Lef £ and n be integers with \ < ^ < £ < n. Let p,p' , q,p\, . . . ,p n be real numbers in (0, 1) 
with q e {p,p'}, Pi = ■ ■ ■ = Pi = q andp£ + i = ■ ■ ■ = p n . Let B (resp. B') be the sum ofn + 1 independent 
Bernoulli distributions with parameters p,pi, ■ ■ ■ ,p n (resp. p',pi, . . . 7 p n )- We have 

Proof Let Z, Z' , Z\ , . . . , Z n be independent Bernoulli distributions with parameters p, p' , p\ , . . . , p n . Define 
S = J2i=i %i> T = J27=e+i an d V — Z + S. By slight abuse of notation, merging in the same notation 
the distribution and the random variable, we have 

KL(B, B') = KL((Z + S) + T, [Z 1 + S)+T) 
< KL((Z + S,T),(Z' + S,T)) 
= KL(Z + S, Z' + S). 

Let Sfe = ¥(S = k) for k = —1, 0, ...,£+ 1. Using the equalities 

Sk = {k) q (1 - q) = T^q~ k \k-l) q {1 - q) = —q^— Sk ^ 

which holds for 1 < k < £ + 1, we obtain 

e+i 



KL, Z + 5, Z ' + S )^P(V^) 1 „ g (»|±|^l) 



k=0 

e+i 



= yp(v = k)\o g (P- Sk - 1 + {1 - p)Sk 



p'sk-i + (1 -p')s k 



fe=0 

e +X , p hJ. k +{\-p){t-k+\) \ 
= VP(V = k) log -V — — 

- El0g {(p>- q )V+(l-p>)q(l + l))- (16) 

First case: q = p'. 

By Jensen's inequality, using that EV = p'(£ + 1) + p — p' in this case, we then get 



KL(Z + S,Z' + S)< log ^ 



log 



(p-j/)E(y) + (i-p) P '(i + i) 
(i-p>)pi(e + i) 

( P -p'f + {i-p')p'(£ + i) 

(l-p>)p>(£ + l) 



iogu + , ^< {p - p,)2 



(1 - p')p'{£ +!)]-{!- p')p'{£ + 1) ' 
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Second case: q = p. 

In this case, V is a binomial distribution with parameters I + 1 and p. From ( fTo*| ), we have 



KL(Z + 5, Z' + S) < -Elog 



(p'-p)y+(l-p')p(l+l) 
(l-p)p(*+l) 



To conclude, we will use the following lemma. 

Lemma 25 The following inequality holds for any x > xq with xq € (0, 1): 

(x-1) 2 



log(x) < -(x - 1) + 



2x 



Proof Introduce /(x) = -(x-l) + ^=^+log(x). We have /'(x) = -l + fct + i, and /"(x) = ±-&. 
From /'(x ) = 0, we get that /' is negative on (x , 1) and positive on (1, +oo). This leads to / nonnegative 
on [xq, +oo). ■ 



Finally, from Lemma 
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and ( |17] >, using xo = we obtain 

p'-p \ 2 E[(V -EV) 2 } 



KL(Z + S,Z , + S)< 



(i- P )p(e + i)j 2x 

p'-p \ 2 (£ + 1)p(1-p) 2 



(l-p)p(l + l)J 2(1 -p>) 
(p'-p) 2 



2(l-p')(^+l)p 
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